IntroductionΒΆ

Companies invest significant resources in marketing campaigns to convert prospective customers into current ones. This process is essential for maintaining revenue levels, reinvesting to improve service quality, responding to shareholders, and more. Therefore, understanding why customers leave and predicting churn with enough anticipation to implement a retention plan is an integral part of business operations. With this in mind, this kernel will focus on exploring the available data to identify potential causes of customer churn, laying the groundwork for future machine learning models.

Instead of using Matplotlib or Seaborn, I chose Plotly for this kernel due to its excellent interactivity. Additionally, Plotly, when combined with Dash, provides tools to build web applications with interactive dashboards.

I hope you find this kernel helpful. If you have any questions or suggestions for improvement, please feel free to open an issue or start a discussion.

RequirementsΒΆ

  • Pandas version: 1.5.3
  • NumPy version: 1.24.4
  • SciPy version: 1.10.1
  • Matplotlib version: 3.7.0
  • Missingno version: 0.5.2
  • Plotly version: 5.13.1

Import Libraries librariesΒΆ

Utility functionsΒΆ

resumetable() | expanded_describe():ΒΆ

spot_missing_values():ΒΆ

reshape_df()ΒΆ

custom_bar_chart()ΒΆ

custome_histograms()ΒΆ

costum_boxplot()ΒΆ

custom_line_chart()ΒΆ

serv_dist_in_num_servs()ΒΆ

Understanding the dataΒΆ

Loading datasetΒΆ

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 
 17  PaymentMethod     7043 non-null   object 
 18  MonthlyCharges    7043 non-null   float64
 19  TotalCharges      7043 non-null   object 
 20  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB

Quick view to the dataset.ΒΆ

customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
5398 3936-QQFLL Male 0 No No 2 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Month-to-month No Mailed check 19.75 39.3 No
1023 7460-ITWWP Female 1 Yes No 45 Yes No Fiber optic No No Yes No No Yes Month-to-month Yes Electronic check 86.10 3861.45 No
3080 1777-JYQPJ Male 0 No No 2 No No phone service DSL No No No No No No Month-to-month No Mailed check 24.30 38.45 No

General Info about Dataset:ΒΆ

More info could be found in the following link:

  • https://community.ibm.com/community/user/businessanalytics/blogs/steven-macko/2019/07/11/telco-customer-churn-1113

Column descriptionΒΆ

Demographic info:ΒΆ

  • gender
  • SeniorCitizen: Indicates if the customer is under 65 or older: Yes, No
  • Partner
  • Dependents

Company services:ΒΆ

  • PhoneService,
  • Multiple lines,
  • Internet service,
  • Online security,
  • Online backup,
  • Device protection,
  • Tech support,
  • Streaming TV,
  • Streaming movies

Customer account information:ΒΆ

  • CustomerID
  • Contract
  • Ternure
  • Paperless billing
  • Payment method
  • Monthly charges
  • Total charges

Additional considerations:ΒΆ

Column tenure indicates the months that a particular customer has been with the company by the end of the quarter (The fiscal quarter that the data has been derived from (e.g. Q3)). I will assume that all the customer cancelation dates are the same

column_name dtypes unique2 count top2 freq2 mean std min 25% 50% 75% max First Value Second Value Third Value Entropy
0 customerID object 7043 7043.0 0002-ORFBO 1 - - - - - - - 7590-VHVEG 5575-GNVDE 3668-QPYBK 3.85
1 gender object 2 7043.0 Male 3555 - - - - - - - Female Male Male 0.30
2 SeniorCitizen int64 2 7043.0 0.0 5901 0.162147 0.368612 0.0 0.0 0.0 0.0 1.0 0 0 0 0.19
3 Partner object 2 7043.0 No 3641 - - - - - - - Yes No No 0.30
4 Dependents object 2 7043.0 No 4933 - - - - - - - No No No 0.27
5 tenure int64 73 7043.0 1.0 613 32.371149 24.559481 0.0 9.0 29.0 55.0 72.0 1 34 2 1.78
6 PhoneService object 2 7043.0 Yes 6361 - - - - - - - No Yes Yes 0.14
7 MultipleLines object 3 7043.0 No 3390 - - - - - - - No phone service No No 0.41
8 InternetService object 3 7043.0 Fiber optic 3096 - - - - - - - DSL DSL DSL 0.46
9 OnlineSecurity object 3 7043.0 No 3498 - - - - - - - No Yes Yes 0.45
10 OnlineBackup object 3 7043.0 No 3088 - - - - - - - Yes No Yes 0.46
11 DeviceProtection object 3 7043.0 No 3095 - - - - - - - No Yes No 0.46
12 TechSupport object 3 7043.0 No 3473 - - - - - - - No No No 0.45
13 StreamingTV object 3 7043.0 No 2810 - - - - - - - No No No 0.46
14 StreamingMovies object 3 7043.0 No 2785 - - - - - - - No No No 0.46
15 Contract object 3 7043.0 Month-to-month 3875 - - - - - - - Month-to-month One year Month-to-month 0.43
16 PaperlessBilling object 2 7043.0 Yes 4171 - - - - - - - Yes No Yes 0.29
17 PaymentMethod object 4 7043.0 Electronic check 2365 - - - - - - - Electronic check Mailed check Mailed check 0.59
18 MonthlyCharges float64 1585 7043.0 20.05 61 64.761692 30.090047 18.25 35.5 70.35 89.85 118.75 29.85 56.95 53.85 3.02
19 TotalCharges object 6531 7043.0 11 - - - - - - - 29.85 1889.5 108.15 3.80
20 Churn object 2 7043.0 No 5174 - - - - - - - No No Yes 0.25

Insights:ΒΆ

  • customerID is unique, which means that we have data from 7043 different customers.
  • TotalCharges, most of its values are unique. It has a wrong data types, and the top frequent value appear empty (' ').
  • Our dataset has 21 columns and 7043 rows.

Spotting missing valuesΒΆ

column_name dtypes isnull value: ' ' value: 'No apply' value: 'empty' total_missing percent_missing
0 customerID object 0 0.0 0.0 0.0 0.0 0.00%
1 gender object 0 0.0 0.0 0.0 0.0 0.00%
2 SeniorCitizen int64 0 0.0 0.0 0.0 0.0 0.00%
3 Partner object 0 0.0 0.0 0.0 0.0 0.00%
4 Dependents object 0 0.0 0.0 0.0 0.0 0.00%
5 tenure int64 0 0.0 0.0 0.0 0.0 0.00%
6 PhoneService object 0 0.0 0.0 0.0 0.0 0.00%
7 MultipleLines object 0 0.0 0.0 0.0 0.0 0.00%
8 InternetService object 0 0.0 0.0 0.0 0.0 0.00%
9 OnlineSecurity object 0 0.0 0.0 0.0 0.0 0.00%
10 OnlineBackup object 0 0.0 0.0 0.0 0.0 0.00%
11 DeviceProtection object 0 0.0 0.0 0.0 0.0 0.00%
12 TechSupport object 0 0.0 0.0 0.0 0.0 0.00%
13 StreamingTV object 0 0.0 0.0 0.0 0.0 0.00%
14 StreamingMovies object 0 0.0 0.0 0.0 0.0 0.00%
15 Contract object 0 0.0 0.0 0.0 0.0 0.00%
16 PaperlessBilling object 0 0.0 0.0 0.0 0.0 0.00%
17 PaymentMethod object 0 0.0 0.0 0.0 0.0 0.00%
18 MonthlyCharges float64 0 0.0 0.0 0.0 0.0 0.00%
19 TotalCharges object 0 11.0 0.0 0.0 11.0 0.16%
20 Churn object 0 0.0 0.0 0.0 0.0 0.00%

Insights:ΒΆ

  • Even though there is not missing values such as np.nan, there are strings values such as ' ' that indicate missing values.
  • According to the previous table, the only column with missing values is TotalCharges. Since it represents 0.16% of the data, the plan of action will be to drop those 11 rows.
  • customerID could be dropped. At this stage, it seems that the id code nomenclature does not hold any hidden information.
column_name dtypes isnull value: ' ' value: 'No apply' value: 'empty' total_missing percent_missing
0 gender object 0 0.0 0.0 0.0 0.0 0.00%
1 SeniorCitizen int64 0 0.0 0.0 0.0 0.0 0.00%
2 Partner object 0 0.0 0.0 0.0 0.0 0.00%
3 Dependents object 0 0.0 0.0 0.0 0.0 0.00%
4 tenure int64 0 0.0 0.0 0.0 0.0 0.00%
5 PhoneService object 0 0.0 0.0 0.0 0.0 0.00%
6 MultipleLines object 0 0.0 0.0 0.0 0.0 0.00%
7 InternetService object 0 0.0 0.0 0.0 0.0 0.00%
8 OnlineSecurity object 0 0.0 0.0 0.0 0.0 0.00%
9 OnlineBackup object 0 0.0 0.0 0.0 0.0 0.00%
10 DeviceProtection object 0 0.0 0.0 0.0 0.0 0.00%
11 TechSupport object 0 0.0 0.0 0.0 0.0 0.00%
12 StreamingTV object 0 0.0 0.0 0.0 0.0 0.00%
13 StreamingMovies object 0 0.0 0.0 0.0 0.0 0.00%
14 Contract object 0 0.0 0.0 0.0 0.0 0.00%
15 PaperlessBilling object 0 0.0 0.0 0.0 0.0 0.00%
16 PaymentMethod object 0 0.0 0.0 0.0 0.0 0.00%
17 MonthlyCharges float64 0 0.0 0.0 0.0 0.0 0.00%
18 TotalCharges float64 0 0.0 0.0 0.0 0.0 0.00%
19 Churn object 0 0.0 0.0 0.0 0.0 0.00%

InsightsΒΆ

  • Previous tables shows that there is not missing values. It possible to proceed with data visualization.

Data VisualizationΒΆ

InsigthsΒΆ

  • The dependent variable is Churn. In general, 26.6% of customer churn, while the company was able to retain 73.4% of the customers.
  • Since the objective is to predict customer churn, this dataset is unbalanced.

VIZ - Demographic evaluationΒΆ

InsightsΒΆ

  • Sex distribution among male and female are similar. It is not showing a clear difference between churn and retained customers.
  • Curiously, customers over 65 years old are more likely to churn (around 42% of churn rate). It could be related to age issues.
  • In general, the number of customers with or without a partner is very similar. On the other hand, most customers do not have dependents (around 70% of them). The interesting point is that customers without a partner or dependents are more likely to churn (approximately 33% and 31%, respectively)

VIZ - Service evaluationΒΆ

InsightsΒΆ

  • Even though most customers have phone service (around 90%), customer with or without service have a similar rate churn (around 25% to 27%).
  • Customers with multiline phone services are slightly more likely to churn, around 29%, compared to the 25% churn rate for customers without multiline services.
  • Around 78% of customer has internet service (fiber optic and DSL). Interestingly, customers with Fiber optic service are more likely to churn, with approximately a churn rate of 42%.
  • Figures show that customers who do not complement their internet service with online security, online backup, device protection, or tech support are more likely to churn. The churn rate for those customers ranges between 39% to 42%. Additionally, the churn rate for customers without or with services such as streaming tv and streaming movies are similar, which is between 30% to 34% churn.

VIZ - Customer account informationΒΆ

InsightsΒΆ

  • Turning to customer service information, 55% of customers have a month-to-month contract with a high churn rate of approximately 43%, which is relatively high compared to 11% churn of customers with one-year contracts or the roughly 3% churn for customers with two-year contracts.

  • Including tenure in the analysis or contract and churn, the histogram shows that the greater the customer commitment, the later the customer service cancelation is. Around 62% of the month-to-month customers churn within the first year. In contrast to month-to-month customers, customers with one-year or two-year contracts seem to delay their churn. Around 70% of One-year contract customers cancel their service after three years, and 65% of two-year contract customers cancel their service after five years.

  • In general, comparing the churn rate per month, we can see that the more time the client is in the company, the lower the customer attrition will be.

  • Interestingly, 59% of customers prefer paperless billing; however, in terms of churn, paperless billing represents a 33% churn compared to other forms of billing that represent 16% churn.

  • Slightly over a third of the customers use electronic checks as preferred payment methods; however, it has a 45% churn. Other payment methods (such as mailed checks, automatic payment with bank transfer, and credit cards) present a churn rate of around 15% to 19%.

  • Around 62% of customers cancel their services the first month, representing 380 customers. After the first three months, the churn rate changes to be lower than the retention rate of customers, and for each month that the company keeps the client, the company can reduce the churn rate up to around 2% (customer with six years).

  • Additionally, according to the boxplot, it seems that customers with higher monthly charges are more likely to churn than those who have to pay lower monthly bills. This difference is accentuated for those customers with a longer time in the company.

  • As was expected, the total charges per customer increased as the customer tenure increased. However, it is interesting that customers with higher total charges are more likely to churn than customers with lower total charges.

  • Counting multiline as a service, each customer can add up to nine services. According to the bar chart, with around a 45% churn rate, customers with three services are more likely to churn.

  • If we give a deeper look at the service distribution of active services for customers who have selected 3 or 4 services, we can see that the churn rate for customers with Internet service (Fiber optic) reaches up to 59% and 51%, respectively.

Note: For the following steps (data mining models), I will keep the new column 'num_services'.

[NbConvertApp] Making directory str(path_report)
[NbConvertApp] Converting notebook FR_Tsk01Exp01NB01_v1_01_Telco_Churn_part1_eda_plotly.ipynb to html
[NbConvertApp] Writing 4251825 bytes to str(path_report)\FR_Tsk01Exp01NB01_v1_report.html